Motivation
In hourly concentrations plotted below for 2013 in Lausanne (LAU) and Zurich (ZUE), we can identify several extreme values. We will particularly focus on the PM\(_{10}\) and EC concentrations in Zurich and examine them further.Outline
Given the problem statement, “describe differences and reasons for variations in pollutant concentrations by across sites and seasons,” we have followed the following strategy:
We will discuss methods for handling observations which fall outside of the general patterns reflected by aggregate measures (summary statistics) of pollutant concentrations so far. These surprisng data points may arise from experimental error, anamolous activity, or unusual meteorology.
For a thorough treatment of this topic, see Barnett and Lewis (1994) from which much of this material has been taken.
In practice, univariate or multivariate observations that fall outside of the general patterns (“extreme values”) are informally separated and removed, or reserved for further analysis.
We will discuss statistical justifications for classifying extreme values as “outliers” in the univariate case, which will provide an appropriate perspective for thinking about extreme observations in the general case.
Definitions:
The decision to label an observation as an outlier is made with consideration to the greater set of observations to which it belongs.
Treatment of outliers
Hypothesis testing framework
We will discuss some conceptual ways to think about extreme values and outliers.
The statistical basis is classifying an extreme value as an outlier originates from a hypothesis testing framework.
| Decision | \(H_0\) True | \(H_a\) |
|---|---|---|
| reject \(H_0\) | Type I error (\(\alpha\)) | Correct decision |
| do not reject \(H_0\) | Correct decision | Type II error (\(\beta\)) |
A discordant observation is one that is statistically unreasonable with respect to prescribed probability model.
Conceptual framework
Statistical tests require a comparison of a working (null) hypothesis and an opposing alternative hypothesis.
Working hypothesis:
The data arise from some common but unspecified distribution \(F\) (i.e., a population characterized by the distribution \(F\)). \[ H_0: F \]
Some alternative hypotheses:
Deterministic alternative. Instead of the possibility that all observations arise from \(F\), all samples \(x_i\) except \(i\neq j\) arise from \(F\) and \(x_j\) is an isolated and sufficiently different type of observation which requires rejection. \[ H_a: \{x_i \sim F : i \neq j\}; \ \text{$x_j$ does not belong to $F$} \] “Deterministically correct” in gross measurement error scenario.
Mixture alternative. The population does not all belong to the population characterized by \(F\), but a mixture of two populations characterized by \(F\) and \(G\), and the degree of mixing (fraction) \(\lambda\). Samples from population \(G\) appears as extreme values in the set of data which are mostly comprised of samples from \(F\). \[ H_a: (1-\lambda) F + \lambda G \] \(\lambda\) would necessarily be small; therefore, estimating \(\lambda\) and \(G\) would typically be difficult in real data sets, but the conceptual idea is that samples from two separate distributions are present in the sample.
There are several types of metrics, three of which are shown here with example implementations.
For a thorough treatment of this topic, see Venables and Ripley (2003) from which much of this material has been taken.
Why will it not suffice to screen data and remove outliers? (Venables and Ripley 2003)
For the sample mean.
Let \[ \tilde{\mu} = \sum c_i x_{(i)} \]
(For further generalization of these concepts, read further on “robust statistics”; \(L\)- and \(M\)-estimators.)
For the sample standard deviation.
There are many other types of robust estimators for these parameters; also for regression coefficients, etc.
I.e., why not use robust estimators all the time?
(Note that an estimator is not the same as an estimate, which is only a single realization of an estimator.
Generally, estimators which are unbiased with low variance are desired.
These properties have consequences for conclusions we draw from hypothesis testing (and problems in statistical inference in general). For descriptive statistics, high variance of an estimator may be less of an issue (though the bias of any single estimate may be greater).
Unbiased estimators have the property that the difference in expected value of the estimator and the true parameter is zero, on average: \[ B(\hat{\theta}) = 0 \] We can describe the variance of an estimator \(\tilde{\theta}\) against another (benchmark) estimator \(\hat{\theta}\), by its asymptotic relative efficiency. \[ \textit{ARE}(\tilde{\theta};\hat{\theta}) = \lim_{n\to\infty} \frac{{\operatorname{Var}}(\hat{\theta})}{{\operatorname{Var}}(\tilde{\theta})} \] For example, some AREs for random variables with specific distributions are shown below:
| Distribution | ARE |
|---|---|
| normal | \(\textit{ARE}({\operatorname{median}}; {\operatorname{mean}}) \approx 64\%\) |
| Student’s \(t\) | \(\textit{ARE}({\operatorname{median}}; {\operatorname{mean}}) \approx 96\%\) |
| normal | \(\textit{ARE}(MAD;\text{std.~dev.}) \approx 37\%\) |
The median perform betters for longer-tailed distributions; MAD (and IQR) are robust with respect to outliers, but not very efficient.
There are other estimators with properties which are desirable in certain contexts, and the user should be aware of these properties and select according to specific needs (for descriptive and inferential statistics).
library(dplyr)
library(reshape2)
library(chron)
library(ggplot2)
source("GRB001.R")
Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots
df <- readRDS("data/2013/lau-zue.rds")
Pivot to long format:
id.vars <- c("site", "datetime", "year", "month", "day", "hour", "season", "dayofwk", "daytype")
lf <- melt(df, id.vars=id.vars)
Select at PM\(_{10}\) for Zurich.
pm10.zue <- filter(lf, site=="ZUE" & variable=="PM10")
pm10.zue[["date"]] <- format(dates(pm10.zue[["datetime"]]), "y.m.d")
pm10.zue[["day"]] <- as.numeric(as.character(pm10.zue[["day"]]))
View the time series.
ggp <- ggplot(mutate(pm10.zue, day.of.month=day+hour/24))+
geom_line(aes(day.of.month, value))+
facet_grid(.~month)+
xlab("Day")+
ylab(expression("PM"[10]~"concentration"~(mu*g/m^3)))
print(ggp)
View the PDF/ECDF.
ggp <- ggplot(pm10.zue, aes(x=value))+
geom_line(stat="density", position="identity")+
geom_point(y=0, shape=4)+
xlab(expression("Hourly PM"[10]~"concentration"~(mu*g/m^3)))+
ylab("Probability density")
print(ggp)
ggp <- ggplot(pm10.zue, aes(value))+
geom_line(stat="ecdf")+
geom_point(stat="ecdf")+
xlab(expression("Hourly PM"[10]~"concentration"~(mu*g/m^3)))+
ylab("Probability")
print(ggp)
We select periods for which the concentration was greater than 100 \(\mu\)g/m\(^3\).
(highvals <- filter(pm10.zue, value > 100))
## site datetime year month day hour season dayofwk daytype variable value
## 1 ZUE 15706.08 2013 Jan 1 2 DJF Tue Weekday PM10 197.5
## 2 ZUE 15706.12 2013 Jan 1 3 DJF Tue Weekday PM10 173.6
## 3 ZUE 15706.17 2013 Jan 1 4 DJF Tue Weekday PM10 145.1
## 4 ZUE 15706.21 2013 Jan 1 5 DJF Tue Weekday PM10 119.5
## 5 ZUE 15918.96 2013 Aug 1 23 JJA Thu Weekday PM10 112.8
## 6 ZUE 15919.00 2013 Aug 2 0 JJA Fri Weekday PM10 146.8
## 7 ZUE 15919.04 2013 Aug 2 1 JJA Fri Weekday PM10 130.2
## date
## 1 2013.01.01
## 2 2013.01.01
## 3 2013.01.01
## 4 2013.01.01
## 5 2013.08.01
## 6 2013.08.02
## 7 2013.08.02
Notable dates:
View the diurnal variations for these days.
highdays <- filter(pm10.zue, date %in% highvals[["date"]])
ggp <- ggplot(highdays, aes(hour, value))+
geom_line()+
geom_point()+
facet_grid(.~date)+
xlab("Hour of day")+
ylab(expression("PM"[10]~"concentration"~(mu*g/m^3)))
print(ggp)
We see that despite these periods of high concentration, the mean daily values are still below the limit of 50 \(\mu\)g/m\(^3\).
highdays %>% group_by(date, variable) %>%
summarize(value=mean(value))
## Source: local data frame [3 x 3]
## Groups: date [?]
##
## date variable value
## (chr) (fctr) (dbl)
## 1 2013.01.01 PM10 46.65833
## 2 2013.08.01 PM10 20.94583
## 3 2013.08.02 PM10 39.82083
As a rather uninteresting example, we can use a trimmed mean as a robust estimator of the mean. Notice that the mean is less influenced by the presence of extreme values. (Note that 10% corresponds to \(\lceil{24\times{}.1}\rceil=2\) observations removed.)
highdays %>% group_by(date, variable) %>%
summarize(value=mean(value, trim=0.1))
## Source: local data frame [3 x 3]
## Groups: date [?]
##
## date variable value
## (chr) (fctr) (dbl)
## 1 2013.01.01 PM10 36.995
## 2 2013.08.01 PM10 16.650
## 3 2013.08.02 PM10 32.900
Under what conditions do PM\(_{10}\) concentration exceed regulatory limits?
Calculate daily means.
daily <- pm10.zue %>%
group_by(date, variable) %>%
summarize(month=month[1],
day=day[1],
value=mean(value,na.rm=TRUE))
View the time series.
ggp <- ggplot(daily)+
geom_line(aes(day, value))+
facet_grid(.~month)+
xlab("Day")+
ylab(expression("PM"[10]~"concentration"~(mu*g/m^3)))
print(ggp)
View PDF/ECDF of daily means.
ggp <- ggplot(daily)+
geom_line(aes(x=value), stat="density", position="identity")+
geom_point(aes(x=value), y=0, shape=4)+
geom_vline(xintercept=50, linetype=2)+
xlab(expression("Daily Mean PM"[10]~"concentration"~(mu*g/m^3)))+
ylab("Probability density")
print(ggp)
ggp <- ggplot(daily, aes(value))+
geom_line(stat="ecdf")+
geom_point(stat="ecdf")+
geom_vline(xintercept=50, linetype=2)+
xlab(expression("Daily Mean PM"[10]~"concentration"~(mu*g/m^3)))+
ylab("Probability")
print(ggp)
If we observe concentrations for days in which the 50 \(\mu\)g/m\(^3\) limit is exceeded, we find that the concentrations are generally high throughout the day, and persist for more than a day at a time.
exceeded <- filter(daily, value > 50)
ggp <- ggplot(filter(pm10.zue, date %in% exceeded[["date"]]), aes(hour, value))+
geom_line()+
geom_point()+
facet_grid(.~date)+
xlab("Hour of day")+
ylab(expression("PM10 concentration"~(mu*g/m^3)))+
scale_y_continuous(limits=c(0,100))
print(ggp)
We can more generally summarize extreme values that exceedance daily limit values set forth by regulation (in Switzerland).
limits.daily <- data.frame(value=c(100,80,8,50),
variable=c("SO2","NO2","CO","PM10"))
Let us compute the daily means, but also note the percent recovery of data.
CalculateMeans <- function(x)
data.frame(percent.recovery=length(na.omit(x))/length(x)*1e2,
value=mean(x, na.rm=TRUE))
daily <- lf %>% filter(variable %in% limits.daily[["variable"]]) %>%
mutate(date = dates(datetime)) %>%
group_by(site, date, variable) %>%
do(CalculateMeans(.[["value"]]))
We select a threshold recovery of 75%:
threshold <- 75
Let us see how many days the data recovery is at or below this threshold for each variable:
dcast(filter(daily, percent.recovery <= threshold),
site ~ variable, length)
## site PM10 SO2
## 1 LAU 6 367
## 2 ZUE 8 0
Lausanne does not have an SO2 monitor, so this makes sense, and we are only missing a few days of PM\(_{10}\) measurements at each site. We can see which dates:
filter(daily, percent.recovery <= threshold & variable=="PM10")
## Source: local data frame [14 x 5]
## Groups: site, date, variable [14]
##
## site date variable percent.recovery value
## (chr) (dats) (fctr) (dbl) (dbl)
## 1 LAU 06/30/2013 PM10 62.500000 20.513333
## 2 LAU 07/01/2013 PM10 41.666667 21.950000
## 3 LAU 07/02/2013 PM10 25.000000 17.200000
## 4 LAU 10/24/2013 PM10 45.833333 11.190909
## 5 LAU 10/25/2013 PM10 45.833333 30.081818
## 6 LAU 11/14/2013 PM10 4.166667 30.200000
## 7 ZUE 05/30/2013 PM10 62.500000 8.780000
## 8 ZUE 05/31/2013 PM10 45.833333 2.872727
## 9 ZUE 06/07/2013 PM10 33.333333 16.275000
## 10 ZUE 06/08/2013 PM10 0.000000 NaN
## 11 ZUE 06/09/2013 PM10 0.000000 NaN
## 12 ZUE 06/10/2013 PM10 0.000000 NaN
## 13 ZUE 06/11/2013 PM10 41.666667 11.900000
## 14 ZUE 11/14/2013 PM10 4.166667 16.400000
Let us visualize the time series with limit values indicated for each variable:
ggplot()+
geom_line(aes(x=date, y=value),
data=daily %>% filter(percent.recovery > threshold))+
geom_hline(aes(yintercept=value),
data=limits.daily,
linetype=2)+
facet_grid(variable~site, scale="free_y")+
scale_x_chron(format="%d.%m")+
theme(axis.text.x=element_text(angle=30, hjust=1))
We can also view in the form of its ECDF:
ggplot()+
geom_line(aes(x=value),
data=daily %>% filter(percent.recovery > threshold),
stat="ecdf")+
geom_point(aes(x=value),
data=daily %>% filter(percent.recovery > threshold),
stat="ecdf")+
geom_vline(aes(xintercept=value),
data=limits.daily,
linetype=2)+
facet_grid(variable~site, scale="free_y")
To select which days, we will use a lookup table as previously described. The following statement creates a named vector where limit values (value) is labeled by the pollutant (variable):
(limits.vec <- with(limits.daily, setNames(value, variable)))
## SO2 NO2 CO PM10
## 100 80 8 50
We can then use this vector (which implements a lookup “table”) to select the dates which exceed the limit values for each variable:
exceedances <- daily %>%
filter(percent.recovery > threshold &
value > limits.vec[as.character(variable)])
We can export this table with the following command:
write.csv2(exceedances, file="exceedances.csv", row.names=FALSE)
Note that write.csv2 uses the European convention for comma-separated-value files, where the delimiter is actually a semicolon (;) rather than comma (,).
In this section, we will examine EC (elemental carbon) concentrations from Zurich.
ec.zue <- filter(lf, site=="ZUE" & variable=="EC")
We can view its ECDF:
ggp <- ggplot(ec.zue, aes(x=value))+
geom_line(stat="ecdf")+
geom_point(stat="ecdf")+
xlab("EC concentration"~(mu*g/m^3))+
ylab("Probability")
print(ggp)
Let us select the day with the highest value:
anomolous.day <- filter(ec.zue, dates(datetime) == dates(datetime[which.max(value)]))
Visualizing the time series for this day only, we see that it happens late at night on 1 August:
ggp <- ggplot(anomolous.day, aes(hour, value))+
geom_line()+
geom_point()+
xlab(paste("Hour of day", dates(anomolous.day[1,"datetime"])))+
ylab("EC concentration"~(mu*g/m^3))
print(ggp)
Clearly, this point is an outlier? It is often convenient to state a criterion by which a measurement is labeled as an outlier (rather than simply an extreme value). This may not mean it is a bad measurement, but that it falls out of expected trends - may be used to motivate periods for further study.
R includes a Dixon test where the test statistic is compared against a distribution tabulated by Dixon. A \(p\)-value is provided with respect to this distribution, but only for small sample sets (<30)—presumably a limitation of the available tabulated values.
require(outliers)
test.out <- dixon.test(anomolous.day[["value"]], two.sided=FALSE)
print(test.out)
##
## Dixon test for outliers
##
## data: anomolous.day[["value"]]
## Q = 0.94017, p-value < 2.2e-16
## alternative hypothesis: highest value 11.9 is an outlier
As we observed, structured observations like measurements of time series will have additional information—for instance, the position of the observation in a sequence, relationship to other variables—to provide further assessment on the discordancy of an observation (or observations). All information that you have at your disposal should be used to make sense of these values.
We will compute several metrics for characterizing the central tendancy and dispersion in a univariate sample. For this we define several functions.
Note that for a normal distribution, \(\textit{IQR} \approx 1.35\sigma\) and \(MAD \approx 0.6745\sigma\) (Venables and Ripley 2003). The mad function returns MAD/0.6745 rather than MAD itself.
Again for a normal distribution, the integrated area between \(\pm 1\sigma\) is 66%, whereas for IQR it is necessarily 50% by definition (regardless of distribution).
WinsorizedMean <- function (x, frac=.05, ...) {
## x is a vector of values
## frac is the fraction to replace at each end
lim <- quantile(x, c(frac/2, 1-frac/2), ...)
x[x < lim[1]] <- lim[1]
x[x > lim[2]] <- lim[2]
mean(x, ...)
}
ComputeCentral <- function(x) {
## x is a vector of values
metric <- c("mean" = mean(x, na.rm=TRUE),
"median" = median(x, na.rm=TRUE),
"trimmed.mean" = mean(x, trim=0.05, na.rm=TRUE),
"Winsorized.mean" = WinsorizedMean(x, frac=0.05, na.rm=TRUE))
data.frame(metric=factor(names(metric), names(metric)), value=metric)
}
ComputeDispersion <- function(x) {
## x is a vector of values
metric <- c("sd" = sd(x, na.rm=TRUE),
"IQR" = IQR(x, na.rm=TRUE),
"MAD" = mad(x, na.rm=TRUE))
data.frame(metric=factor(names(metric), names(metric)), value=metric)
}
Compare central metrics:
metric.central <- lf %>% group_by(site, variable) %>%
do(ComputeCentral(.[["value"]]))
ggp <- ggplot(metric.central)+
geom_bar(aes(metric, value), stat="identity", fill="gray")+
facet_grid(variable~site, scale="free_y")+
xlab("")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(ggp)
Compare dispersion metrics:
metric.disp <- lf %>% group_by(site, variable) %>%
do(ComputeDispersion(.[["value"]]))
ggp <- ggplot(metric.disp)+
geom_bar(aes(metric, value), stat="identity", fill="gray")+
facet_grid(variable~site, scale="free_y")+
xlab("")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(ggp)
Possible causes?
Strageties:
Barnett, Vic, and T. Lewis. 1994. Outliers in Statistical Data. Wiley Series in Probability & Statistics. Wiley.
Venables, W. N., and B. D. Ripley. 2003. Modern Applied Statistics with S. Springer.